Home Credit Default Risk

GROUP-05-HCDR


Team and project meta information

Members:

members.png


Project Abstract

The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition.

The challenge is to construct a model that can predict the level of risk associated with an individual loan. With this project, we intend to use historical loan application data to predict whether or not a borrower will be able to repay a loan.

Based on our comparison of results from phase 1, the following two algorithms were shortlisted:

In phase 1, we faced issues related to data size, unwanted data and lack of data tuning. In phase 2, our main goal is to add feature engineering (which includes optimizing the data by removing the missing values, merging correlated data and implement one hot encoding) and hyperparameter tuning (which includes tuning the algorithm by choosing the optimal set of parameters using GridSearchCV) to the phase 1 algorithm.

Implementing feature engineering and hyperparameter tuning to the models gave us the following results:


Project Description (tasks and data)

Data

Dataset link: https://www.kaggle.com/c/home-credit-default-risk/data

Background of the data

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview

There are 7 different sources of data:

home_credit.png

Tasks

IMG_CA3A840ED50E-1.jpeg

Importing all the necessary python libraries:

Reading the csv data files:


Exploratory Data Analysis + Feature Engineering

EDA, or exploratory data analysis, is an essential component of any Data Analysis or Data Science project. Essentially, EDA entails analyzing the dataset to identify patterns, anomalies (outliers), and hypotheses based on our understanding of the dataset.

EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate.

Data description using pandas dataframe

Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

Training data information

Testing data information

Feature Extraction - Phase 1

By creating new features from the existing ones (and then discarding the original features), Feature Extraction attempts to reduce the number of features in a dataset. The new reduced set of features will be able to summarize much of the information that was contained in the original set of features. Thus, an abridged version of the original features can be created by combining them.

In our analysis of the data, we found that there are many missing values. Columns with more than 25% of missing values were removed. Our team checked the columns for the distribution of 0's and removed the columns with 85% of rows with only 0's. In addition, we divided the data into numerical and categorical data. The numerical data was handled by creating an intermediate imputer pipeline in which the missing values were replaced with the mean of the data, while the missing values in categorical missing data were handled by encoding the data based upon OHE (One Hot Encoding) and replacing the missing values with the mode of the columns.

Firstly, let's find the percentage of the missing values in each column:

Using 25% as the missing treshold value, we extract all the columns that has missing percentage less than the treshold value:

To optimize the data, we check each column for all the zero or null values and if 85% or more of the data in that column is filled with zero or null, we remove that particular column:

Printing all the columns that contain at least 85% of its data as either zero or null:

Dropping all the columns that contain at least 85% of its data as either zero or null:

Saving all the training data targets in the numerical dataframe:

Checking co-relations of each numerical data which is greater than 3% for both positive and negative co-relation:

Checking for data that contains no missing values in the categorical dataframe:

Dropping all the categorical dataframes that have missing values:


Visual Exploratory Data Analysis

In order to obtain a deeper understanding of the data, EDA involves generating summary statistics based on numerical data and creating various graphical representations in order to better understand the data. Data Visualization represents the text or numerical data in a visual format, which makes it easy to grasp the information the data express. We, humans, remember the pictures more easily than readable text, so Python provides us various libraries for data visualization like matplotlib, seaborn, plotly, etc. In this tutorial, we will use Matplotlib and seaborn for performing various techniques to explore data using various plots.

Visual EDA on numerical data

Numerical data refers to the data that is in the form of numbers, and not in any language or descriptive form.

Here we are plotting graphs of some columns which are positively correlated with target variable and analayzing the trends

Here we are plotting graphs of some columns which are negatively correlated with target variable and analayzing the trends

Plotting heatmap to analyze correlatoin in application train dataset

Plotting heatmap to see correlation in application test dataset

Visual EDA on categorical data

Categorical data refers to a data type that can be stored and identified based on the names or labels given to them.

From this graph we can figure out that if a person owns a realty or not

From the above graph we can see that the most borrowing category are the people who are from the working class

From the above graph we can see the type of loan people take

From the above pie chart we can that Married people tend borrow more money


Modeling Pipelines

Now comes the fun part. Models are general rules in a statistical sense.Think of a machine learning model as tools in your toolbox. You will have access to many algorithms and use them to accomplish different goals. The better features you use the better your predictive power will be. After cleaning your data and finding what features are most important, using your model as a predictive tool will only enhance your model decision making.

Collectively, the linear sequence of steps required to prepare the data, tune the model, and transform the predictions is called the modeling pipeline. Modern machine learning libraries like the scikit-learn Python library allow this sequence of steps to be defined and used correctly (without data leakage) and consistently (during evaluation and prediction). A pipeline is a linear sequence of data preparation options, modeling operations, and prediction transform operations.

The modeling pipeline is an important tool for machine learning practitioners. Nevertheless, there are important implications that must be considered when using them. The main confusion for beginners when using pipelines comes in understanding what the pipeline has learned or the specific configuration discovered by the pipeline.

Therefore, for this project we are going to use 3 different modeling pipeline methods to perform home credit default risk prediction and they are:

We will choose the model that gives the best accuracy for the home credit default risk prediction.

We will be using the following pipeline for this project:

IMG_7E3EED8EF24A-1.jpeg

Importing all the necessary python libraries for the different pipelines we are going to use:

Selecting only the columns that we have finally decided for the numerical and the categorical part:

Here for pipeline for numerical data(numerical_pipeline) we are imputing the missing values by mean of the column and for pipeline for categorical data(categorical_pipeline) we are imputing missing values by most frequent data and implementing one hot encoding for categorical pipeline to deal with categorical data.

Then we have created a pipeline to merge numerical and categorical pipelines using ColumnTransformer.

Feature Extraction - Phase 2

Saving transformed dataset used for the model training:

We are merging the different dataframes together according to the data diagram shown in the data description section above using primary keys:

We are merging training dataset with the bureau after cleaning the bureau dataset:

Defining a function to find target correlation with the other features:

We are additionally adding a few self-made features to the training dataset and they are as follows:

Finding the correlation between the newly made features and the target feature:

We are shortlisting all the features with a correlation value of greater than 8% with respect to target:

Using the shortlisted features for the training dataset and spliting the whole dataset into training and testing datasets:

We have performed following experiments with different groups of features that we have newly created. With the hypermarameter tuning and the datasets including these features we have calculated the accuracies for different models. Then we have found the best group of features from these experiment.

WhatsApp%20Image%202022-04-19%20at%209.21.33%20PM.jpeg

Hyperparameter Tuning

Link: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

When creating a machine learning model, you'll be presented with design choices as to how to define your model architecture. Often times, we don't immediately know what the optimal model architecture should be for a given model, and thus we'd like to be able to explore a range of possibilities. In true machine learning fashion, we'll ideally ask the machine to perform this exploration and select the optimal model architecture automatically. Parameters which define the model architecture are referred to as hyperparameters and thus this process of searching for the ideal model architecture is referred to as hyperparameter tuning.

We can use GridSearch to tune the hyperparameters. Grid Search uses a different combination of all the specified hyperparameters and their values and calculates the performance for each combination and selects the best value for the hyperparameters. This makes the processing time-consuming and expensive based on the number of hyperparameters involved. In GridSearchCV, along with Grid Search, cross-validation is also performed. Cross-Validation is used while training the model. As we know that before training the model with data, we divide the data into two parts – train data and test data. In cross-validation, the process divides the train data further into two parts – the train data and the validation data.

Naive Bayes

Library: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

Naive Bayes is a basic but effective probabilistic classification model in machine learning that draws influence from Bayes Theorem.

Bayes theorem is a formula that offers a conditional probability of an event A taking happening given another event B has previously happened. Its mathematical formula is as follows: –

image.png

Where

A and B are two events P(A|B) is the probability of event A provided event B has already happened. P(B|A) is the probability of event B provided event A has already happened. P(A) is the independent probability of A P(B) is the independent probability of B Now, this Bayes theorem can be used to generate the following classification model –

image.png

Where

X = x1,x2,x3,.. xN аre list оf indeрendent рrediсtоrs y is the class label P(y|X) is the probability of label y given the predictors X The above equation may be extended as follows:

image.png

We are not considering Naive Bayes model from phase 2 onwards because the accuracy given by this model was the least during phase 1.

Logistic Regression

Library: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Logistic regression is a statistical method for predicting binary classes. The outcome or target variable is dichotomous in nature. Dichotomous means there are only two possible classes. For example, it can be used for cancer detection problems. It computes the probability of an event occurrence.

It is a special case of linear regression where the target variable is categorical in nature. It uses a log of odds as the dependent variable. Logistic Regression predicts the probability of occurrence of a binary event utilizing a logit function.

Linear Regression Equation:

image.png

Where, y is dependent variable and x1, x2 ... and Xn are explanatory variables.

Sigmoid Function:

image.png

Apply Sigmoid function on linear regression:

image.png

Properties of Logistic Regression:

The dependent variable in logistic regression follows Bernoulli Distribution. Estimation is done through maximum likelihood. No R Square, Model fitness is calculated through Concordance, KS-Statistics.

Random Forest

Library: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression.

Random forest works on the Bagging principle. Bagging, also known as Bootstrap Aggregation is the ensemble technique used by random forest. Bagging chooses a random sample from the data set. Hence each model is generated from the samples (Bootstrap Samples) provided by the Original Data with replacement known as row sampling. This step of row sampling with replacement is called bootstrap. Now each model is trained independently which generates results. The final output is based on majority voting after combining the results of all models. This step which involves combining all the results and generating output based on majority voting is known as aggregation.

Steps involved in random forest algorithm:

image.png


Results and Discussion of Results

In our industry, we consider different kinds of metrics to evaluate our models. The choice of metric completely depends on the type of model and the implementation plan of the model. After finishing the building of the model, multiple metrics can be used to help in evaluating your model’s accuracy.

Metrics

For this project, we are going to use the following performance metrics for each of the training models seperately:

image.png

image.png

where, y_ij, indicates whether sample i belongs to class j or not p_ij, indicates the probability of sample i belonging to class j

image.png

Importing all the necessary metrics libraries:

Model Results

Naive Bayes

Logistic Regression

Random Forest

Compiled results

Here we see that the Random Forest Classifier has the highest training accuracy of 100%, but an accuracy like this carries the risk of overfitting. As for the logistic regression model, the training accuracy has increased from 92% to 94% in this phase after implementing feature engineering and hyperparamater tuning. This score is good enough and appears to be a reliable model. Based on the ROC Area Under Curve values for the Random Forest Classifier and Logistic regression models, the ROC values are 0.689 and 0.798, respectively, showing a large amount of True Positive values, indicating a good fit to the data. Our model cannot be based on Naive Bayes since it appears to underfit the data during phase 1 and therefore we did not consider the model for this phase. For us to confirm if Random Forest is overfitting, we should examine how the Logistic Regression model and the Random Forest classifier behave on the test data. As we can see the test accuracy for both of the models is exactly the same and therefore we will be considering logistic regression as a better model based on training accuracy.


Kaggle Submission

Machine learning competitions are a great way to improve your skills and measure your progress as a data scientist. If you are using data from a competition on Kaggle, you can easily submit it from your notebook. We make submissions in CSV files. Your submissions usually have two columns: an ID column and a prediction column. The ID field comes from the test data (keeping whatever name the ID field had in that data, which for the housing data is the string 'Id'). The prediction column will use the name of the target field.

The results for our project submission on kaggle is as follows:

Note: We did not submit the file for Naive Bayes because the accuracy is way too low to begin with.

kaggle_logistic.jpeg

WhatsApp%20Image%202022-04-19%20at%2010.00.44%20PM.jpeg


Conclusions

We are attempting to predict whether the credit-less population will be able to repay their loans. We sourced our data from the Home Credit dataset in order to realize this goal. Having a fair chance to obtain a loan is extremely important to this population, and as students we have a strong connection with this. As a result, we have decided to pursue this project. During the first phase, we begin to experiment with the dataset. After performing OHE on the data, we used imputation techniques to fix it before feeding it into the model.

We have used both Logistic Regression and Random Forests Classifier models in this stage. This was accomplished by cleaning the data, developing new features, identifying correlations with the target data, etc. In addition, we implemented gridsearch and cross-validation in the model pipelines to optimize the hyperparameters. We obtained the best model for Phase 2 from Logistic Regression that has a training accuracy of 94.37%, a test accuracy of 92.20%, and a Kaggle submission accuracy of 0.73. Phase 2 was very important to increase the accuracy by even a little bit and also to fine tune the whole algorithm. We will implement a deep learning model and a multitask loss function for FUTURE SCOPE, which is basically a PyTorch model for loan default classification that makes use of MLP. Additionally, the pipeline will include a regression model with at least one target value. We will finally build a multi-headed load default system using Python's OOP API and a combined loss function.


References